Project: SRP033351
Aligner: STAR (2.5.2b)
Genome: For human, the hg38 assembly was used. We estimate the number of rRNA reads as those mapped to chrM plus chrUn_GL000220v1, corresponding to 12S, 16S and 5.8S rRNA. The ‘Other’ category contains all other chrrandom and chrUn available.
Informatics tools used:
Sequencing parameters:
For each sample, the following programs were run to generate the data necessary to create this report. Written as for unstranded paired-end data. For single-end reads, R2s and insert size metrics would be omitted.
java -Xmx1024m TrimmomaticPE -phred33 [raw_sample_R1] [raw_sample_R2] [sample_R1] [sample_R1_unpaired] [sample_R2] [sample_R2_unpaired] HEADCROP:[bases to trim, if any] ILLUMINACLIP:[sample_primer_fasta]:2:30:10 MINLEN:50
fastqc [sample_R1] [sample_R2]
cat [sample_R1/R2] | awk ’((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(count[read]==1){unique++}};print total,unique,unique*100/total}’
The following STAR options were used:
STAR –genomeDir [ref_genome_index] –runThreadN 12 –outReadsUnmapped Fastx –outMultimapperOrder Random –outSAMmultNmax 1 –outFilterIntronMotifs RemoveNoncanonical –outSAMstrandField intronMotif –outSAMtype BAM SortedByCoordinate –readFilesIn [sample_R1] [sample_R2]
Using aligned output files accepted_hits.bam and unmapped.bam:
samtools sort accepted_hits.bam accepted_hits.sorted
samtools index accepted_hits.sorted.bam
samtools idxstats accepted_hits.sorted.bam > accepted_hits.sorted.stats
bamtools stats -in accepted_hits.sorted.bam > accepted_hits.sorted.bamstats
bamtools filter -in accepted_hits.sorted.bam -script cigarN.script | bamtools count
samtools view -c unmapped.bam
java -Xmx2g -jar CollectRnaSeqMetrics.jar REF_FLAT=[ref_flat file] STRAND_SPECIFICITY=NONE INPUT=accepted_hits.bam OUTPUT=RNASeqMetrics
java -Xmx2g -jar CollectInsertSizeMetrics.jar HISTOGRAM_FILE=InsertSizeHist.pdf INPUT=accepted_hits.sorted.bam OUTPUT=InsertSizeMetrics (for paired-end library)
The number of raw reads correspond to those that passed Casava QC filters, were trimmed to remove adaptors by Trimmomatic, and were aligned by STAR to ref_genome+ERCC transcripts as reported in .info files. Unique read counts were obtained by using awk on trimmed fastq files. FastQC estimates of percentage of sequences remaining after deduplication were retrieved from fastqc_data.txt files. Bamtools statistics were based on sorted and indexed bam files. The mapped reads were those that mapped to reference and were output by STAR to accepted_hits.bam. The unmapped reads were output by STAR to unmapped.bam. Some reads may be mapped to multiple locations in the genome so that the number of total reads reported by bamstats may be greater than the number of raw reads. The Junction spanning reads are computed based on accepted_hits.bam CIGAR entries containing “N.” Related text files that were saved:
SRP033351_read_counts.txt
SRP033351_duplicates.txt
SRP033351_unique_counts.txt
SRP033351_bamstats_counts.txt
All the QC metrics were saved in the file: SRP033351_qc_metrics_all.txt
Read counts are shown by per million reads.
The Picard Tools RnaSeqMetrics function computes the number of bases assigned to various classes of RNA. It also computes the coverage of bases across all transcripts (normalized to a same-sized reference). Computations are based on comparison to a refFlat file. Related text files that were saved:
SRP033351 _rnaseqmetrics_summary.txt
SRP033351 _rnaseqmetrics_hist.txt
The Picard Tools RnaSeqMetrics function computes the number of bases assigned to various classes of RNA. It also computes the coverage of bases across all transcripts (normalized to a same-sized reference). Computations are based on comparison to a refFlat file. Related text files that were saved:
SRP033351 _rnaseqmetrics_summary.txt
SRP033351 _rnaseqmetrics_hist.txt
For paired-end data, the Picard Tools CollectInsertSizeMetrics function was used to compute the distribution of insert sizes in the accepted_hits.bam file and create a histogram. Related text files that were saved:
SRP033351 _insertmetrics_summary.txt
Samtools produces a summary document that includes the number of reads mapped to each chromosome. Related text files that were saved:
SRP033351 _counts.txt
For samples that contained External RNA Controls Consortium (ERCC) Spike-Ins, dose response curves (i.e. plots of ERCC transcript FPKM vs. ERCC transcript molecules) were created. Ideally, the slope and R2 would equal 1.0.
Error in HTML(print(xtable(ercc.fit.table, caption = “ERCC Spike-in Dose Response Fit Details”), : could not find function “HTML”
| PC | Proportion of Variance (%) | Cumulative Proportion of Variance (%) |
|---|---|---|
| PC1 | 40.37 | 40.37 |
| PC2 | 28.32 | 68.68 |
| PC3 | 14.98 | 83.66 |
| PC4 | 10.19 | 93.85 |
| PC5 | 2.187 | 96.04 |
| PC6 | 1.182 | 97.22 |
| PC7 | 0.7979 | 98.02 |
| PC8 | 0.5397 | 98.56 |
| PC9 | 0.3911 | 98.95 |
| PC10 | 0.2602 | 99.21 |
PCA plots are generated using the first two principle components colored by known factors (e.g. treatment/disease conditions, tissue, and donors), visualizing similarities between arrays and these similarities’ correlation to batch effects.
Numbers of reads that can not mapped to any feature (Nofeature count) are shown by per million reads from htseq-count quantification results
R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8, LC_COLLATE=en_US.UTF-8, LC_MONETARY=en_US.UTF-8, LC_MESSAGES=en_US.UTF-8, LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_US.UTF-8 and LC_IDENTIFICATION=C
attached base packages: parallel, stats4, stats, graphics, grDevices, utils, datasets, methods and base
other attached packages: DESeq2(v.1.28.1), SummarizedExperiment(v.1.18.2), DelayedArray(v.0.14.1), matrixStats(v.0.57.0), Biobase(v.2.48.0), GenomicRanges(v.1.40.0), GenomeInfoDb(v.1.24.2), IRanges(v.2.22.2), S4Vectors(v.0.26.1), BiocGenerics(v.0.34.0), knitr(v.1.30), ggplot2(v.3.3.2), DT(v.0.16), RColorBrewer(v.1.1-2), pander(v.0.6.3), tidyr(v.1.1.2) and rmarkdown(v.2.5)
loaded via a namespace (and not attached): locfit(v.1.5-9.4), Rcpp(v.1.0.5), lattice(v.0.20-41), digest(v.0.6.25), R6(v.2.4.1), RSQLite(v.2.2.1), evaluate(v.0.14), pillar(v.1.4.6), zlibbioc(v.1.34.0), rlang(v.0.4.7), annotate(v.1.66.0), blob(v.1.2.1), Matrix(v.1.2-18), labeling(v.0.3), splines(v.4.0.2), BiocParallel(v.1.22.0), geneplotter(v.1.66.0), stringr(v.1.4.0), htmlwidgets(v.1.5.2), bit(v.4.0.4), RCurl(v.1.98-1.2), munsell(v.0.5.0), compiler(v.4.0.2), xfun(v.0.19), pkgconfig(v.2.0.3), mgcv(v.1.8-31), htmltools(v.0.5.0), tidyselect(v.1.1.0), tibble(v.3.0.3), GenomeInfoDbData(v.1.2.3), XML(v.3.99-0.5), crayon(v.1.3.4), dplyr(v.1.0.2), withr(v.2.3.0), bitops(v.1.0-6), grid(v.4.0.2), nlme(v.3.1-148), jsonlite(v.1.7.1), xtable(v.1.8-4), gtable(v.0.3.0), lifecycle(v.0.2.0), DBI(v.1.1.0), magrittr(v.1.5), scales(v.1.1.1), stringi(v.1.5.3), farver(v.2.0.3), XVector(v.0.28.0), genefilter(v.1.70.0), ellipsis(v.0.3.1), generics(v.0.0.2), vctrs(v.0.3.4), tools(v.4.0.2), bit64(v.4.0.5), glue(v.1.4.2), purrr(v.0.3.4), crosstalk(v.1.1.0.1), survival(v.3.1-12), yaml(v.2.2.1), AnnotationDbi(v.1.50.3), colorspace(v.1.4-1) and memoise(v.1.1.0)